Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.
translated by 谷歌翻译
This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.
translated by 谷歌翻译
本文提出了一个简单而有效的框架蒙版,该框架将新提出的掩盖自distillation纳入对比的语言图像预处理中。掩盖自distillation的核心思想是将表示从完整的图像提取到蒙版图像预测的表示形式。这种合并享有两个重要的好处。首先,掩盖的自我验证目标是本地贴片表示学习,这与视觉对比度的互补,专注于与文本相关的表示。二,掩盖的自我验证也与视觉语言对比符合训练目标的视野对比是一致的。视觉编码器用于功能对齐,因此能够学习本地语义从该语言中获得间接监督。我们提供了专门设计的实验,并进行了全面的分析,以验证这两个好处。从经验上讲,我们表明,当MaskClip应用于各种具有挑战性的下游任务时,可以在线性探测,填充和零拍摄中取得卓越的结果,并在语言编码器的指导下取得了卓越的结果。
translated by 谷歌翻译
我们提出了引导蒙面的自动编码器(bootmae),这是一种新的视觉BERT预训练方法。 Bootmae用两个核心设计改进了原始的蒙版自动编码器(MAE):1)动量编码器,该动量编码器可作为额外的BERT预测目标提供在线功能; 2)试图降低编码器的压力以记住目标特定信息的靶向解码器。第一个设计的动机是通过观察到的,即使用预定的MAE提取特征,因为掩盖令牌的BERT预测目标可以实现更好的预训练性能。因此,我们与原始的MAE编码器并行添加了一个动量编码器,该编码器通过将其自己的表示作为BERT预测目标来引导预处理性能。在第二个设计中,我们将特定于目标的信息(例如,未掩盖贴片的像素值)直接传达到解码器中,以减少记住目标特定信息的编码器的压力。因此,编码器专注于语义建模,这是BERT预训练的目的,并且不需要浪费其在记住与预测目标相关的未掩盖令牌的信息时的能力。通过广泛的实验,我们的Bootmae在ImageNet-1k上获得了$ 84.2 \%$ $ $ $+0.8 \%$在同一预训练时期。 Bootmae还获得了$+1.0 $ MIOU在ADE20K上的语义细分和$+1.3 $ box ap,$+1.4 $+1.4 $ bask ap改进对象检测和可可数据集上的细分。代码在https://github.com/lightdxy/bootmae上发布。
translated by 谷歌翻译
与生成的对抗网(GAN)相比,降级扩散概率模型(DDPM)在各种图像生成任务中取得了显着成功。关于语义图像综合的最新工作主要遵循\ emph {de exto}基于gan的方法,这可能导致生成图像的质量或多样性不令人满意。在本文中,我们提出了一个基于DDPM的新型框架,用于语义图像合成。与先前的条件扩散模型不同,将语义布局和嘈杂的图像作为输入为U-NET结构,该结构可能无法完全利用输入语义掩码中的信息,我们的框架处理语义布局和嘈杂的图像不同。它将噪声图像馈送到U-NET结构的编码器时,而语义布局通过多层空间自适应归一化操作符将语义布局馈送到解码器。为了进一步提高语义图像合成中的发电质量和语义解释性,我们介绍了无分类器的指导采样策略,该策略承认采样过程的无条件模型的得分。在三个基准数据集上进行的广泛实验证明了我们提出的方法的有效性,从而在忠诚度(FID)和多样性〜(LPIPS)方面实现了最先进的性能。
translated by 谷歌翻译
在本文中,我们介绍了人际内和人际关系网络(I^2R-NET),以进行多人姿势估计。它涉及两个基本模块。首先,人类内部关系模块在一个人身上运行,旨在捕获人类内部依赖性。其次,人际关系模块考虑了多个实例之间的关系,并着重于捕获人间的相互作用。人际关系间的关系模块可以通过减少特征图的分辨率来设计非常轻巧,但学习有用的关系信息以显着提高人类内部关系模块的性能。即使没有铃铛和哨子,我们的方法也可以竞争或胜过当前的比赛获胜者。我们对可可,人群和ochuman数据集进行了广泛的实验。结果表明,所提出的模型超过了所有最新方法。具体而言,所提出的方法在众群数据集上达到了77.4%的AP和Ochuman数据集上的67.8%AP,从而超过了现有方法的大幅度优于较大的利润率。此外,消融研究和可视化分析还证明了我们的模型的有效性。
translated by 谷歌翻译
蒙版的图像建模(MIM)学习具有非常好的微调性能的表示形式,掩盖了先前普遍的预训练方法,例如图像分类,实例对比度学习和图像文本对齐。在本文中,我们表明,通过以功能蒸馏(FD)形式进行简单的后处理,可以显着改善这些预训练方法的下部微调性能。功能蒸馏将旧表示形式转换为具有一些理想属性的新表示形式,就像MIM产生的表示一样。这些属性总共称为优化友好性,通过一组与注意力和优化相关的诊断工具来识别和分析。借助这些属性,新表示表现出强烈的微调性能。具体而言,对比度的自我监督学习方法在微调方面具有竞争力,就像最先进的蒙版图像建模(MIM)算法一样。剪辑模型的微调性能也得到了显着改善,夹子VIT-L模型达到\ TextBf {89.0%} TOP-1的ImagEnet-1K分类精度。在30亿参数SWINV2-G模型上,ADE20K语义分割的微调精度通过+1.5 miou提高到\ textbf {61.4 miou},创建了新记录。更重要的是,我们的工作为未来的研究提供了一种方法,可以将更多的精力集中在学习表现的通用性和可扩展性上,而不会与优化友好性相处,因为它可以很容易地增强。该代码将在https://github.com/swintransformer/feature-distillation上找到。
translated by 谷歌翻译
尽管在广泛的愿景任务中取得了诱人的成功,但变形金刚尚未在高分辨率图像生成建模中作为Convnets的讨论能力。在本文中,我们寻求探索使用纯变压器来构建用于高分辨率图像合成的生成对抗网络。为此,我们认为,当地的关注是在计算效率和建模能力之间取得平衡至关重要。因此,所提出的发电机采用基于风格的架构中的Swin变压器。为了实现更大的接收领域,我们提出了双重关注,同时利用本地和移位窗的上下文,从而提高了发电质量。此外,我们表明提供了在基于窗口的变压器中丢失的绝对位置的知识极大地利益了代理。所提出的STYLESWIN可扩展到高分辨率,粗糙几何和细结构都受益于变压器的强效力。然而,在高分辨率合成期间发生阻塞伪像,因为以块明智的方式执行局部注意力可能会破坏空间一致性。为了解决这一点,我们经验研究了各种解决方案,其中我们发现采用小波鉴别器来检查光谱差异的措施有效地抑制伪影。广泛的实验表明了对现有的基于变压器的GAN的优越性,特别是在高分辨率上,例如高分辨率,例如1024x1024。如果没有复杂的培训策略,则在Celeba-HQ 1024上赢得了STYLEGAN,并且在FFHQ-1024上实现了对PAR的表现,证明了使用变压器进行高分辨率图像生成的承诺。代码和模型将在https://github.com/microsoft/styleswin上使用。
translated by 谷歌翻译